Skip to content

feat!: pluggable secrets/connection backends (k8s, infisical, aws)#135

Merged
opzkit-auto-merge[bot] merged 13 commits into
mainfrom
refactor/secrets-backend-interface
May 7, 2026
Merged

feat!: pluggable secrets/connection backends (k8s, infisical, aws)#135
opzkit-auto-merge[bot] merged 13 commits into
mainfrom
refactor/secrets-backend-interface

Conversation

@peter-svensson

@peter-svensson peter-svensson commented May 4, 2026

Copy link
Copy Markdown
Member

Summary

Refactors the operator's credential-storage layer from AWS-Secrets-Manager-only into a pluggable secrets.Backend interface, then adds two new implementations alongside the existing AWS one. Uses the seam to clean up the CRD shape (one breaking CRD change), shrink the controller, and surface a generic status field.

After this PR:

spec:
  connectionString:
    aws | kubernetes:        # admin DSN source — pick one
      ...
  secretBackend:
    aws | kubernetes | infisical:   # generated-credential destination — pick one
      ...

Commits

  • e9d5fe9 refactor(secrets): introduce Backend interface + factoryBackend interface + secrets.NewBackend, AWS adapter methods. No behaviour change.
  • 28aa53c refactor(secrets): replace SecretARN+SecretRegion with SecretLocator — single generic status field; AWS region migration now derives the previous region from the stored ARN via secrets.AWSRegionFromARN.
  • d7f4c2e chore(crds): regenerate from controller-gen v0.19.0make manifests && make helm-crds artefact-only refresh.
  • 158fda5 refactor!(api): nest backends under spec.secretBackend / spec.connectionStringBREAKING. Drops spec.awsSecretsManager, spec.connectionStringSecretRef, spec.connectionStringAWSSecretRef in favour of the discriminated unions above.
  • 1ccba93 feat(secrets): add Kubernetes Secret backend.
  • cb4a922 refactor(controller): route destination secret ops through Backend.
  • f08eaf7 feat(secrets): add Infisical Cloud backend.
  • 0b97bf9 fix(controller): address PR #135 smoke-test blockers — five fixes (leader-election manifest, kubernetes-backend-friendly default secretName, RBAC verbs, recovery from missing-secret state, CURRENT_USER → literal-role GRANT).
  • ea83a13 docs: update CRD examples and prose for new spec.connectionString/secretBackend.
  • 3b772f9 fix(database): grant admin role WITH INHERIT, SET for PG 16+ — addresses sixth smoke-test blocker on Scaleway/self-hosted PG 16+ (must be able to SET ROLE, 42501). Adds WITH INHERIT TRUE, SET TRUE to the admin grant; PG <16 honours INHERIT and ignores SET.
  • 09304de docs: document supported database versions — README now lists PG 14+ (tested 15/16/17), MySQL 8.0+, MariaDB 10.6+; clarifies AWS creds only needed for AWS backends.

Breaking changes

CRD field renames (old fields removed)

Old (v0.1.x) New (this PR)
spec.connectionStringSecretRef.{name,key} spec.connectionString.kubernetes.{name,key}
spec.connectionStringAWSSecretRef.{secretName,region,key} spec.connectionString.aws.{secretName,region,key}
spec.awsSecretsManager.{region,description,tags} spec.secretBackend.aws.{region,description,tags}

Status field rename: secretARN, secretRegion → single generic secretLocator string.

Manager flag removal

--leader-elect is no longer registered by the binary. LeaderElection: true is hardcoded in cmd/manager/main.go (always-on for safe multi-replica operation). The kustomize-deployed config/manager/manager.yaml no longer passes the flag. Helm values never passed it, so chart users are unaffected.

Action for kustomize users overriding the arg list: drop --leader-elect from your patch. Passing the flag will crash the manager (flag provided but not defined: -leader-elect).

New dependency footprint

feat(secrets): add Infisical Cloud backend adds github.com/infisical/go-sdk v0.7.1, which transitively pulls in google.golang.org/api, OpenTelemetry SDKs, and grpc updates. Substantial. The SDK is the official path for Universal Auth + V3 secret CRUD; switching to a hand-rolled REST client is on the table for v0.3 if the dep weight is unacceptable.

Test plan

  • go build ./...
  • go test ./... — controller, database, secrets all green; Kubernetes backend tested via controller-runtime fake client; Infisical backend tested via in-memory fake implementing InfisicalSecretClient.
  • make manifests && make helm-crds produces no diff
  • AWS Secrets Manager — regression smoke against an RDS-style provider with spec.secretBackend.aws and spec.connectionString.aws.
  • UpCloud Managed PostgreSQL — smoke against UpCloud's managed PG (TLS-required, restricted admin role) with spec.connectionString.aws and any secretBackend.
  • Scaleway Managed RDB — smoke against Scaleway managed PostgreSQL (validates the CURRENT_USER GRANT fix, the PG 16+ WITH INHERIT, SET grant, and the _rdb_superadmin trigger interaction).
  • AWS RDS PostgreSQL 17 — smoke against PG 17 to confirm the upper bound of the documented support range.
  • spec.secretBackend.kubernetes — full create/update/delete cycle against any of the above DB providers with credentials stored as a Kubernetes Secret.
  • spec.secretBackend.infisical — full cycle against Infisical Cloud with Universal Auth.

Follow-ups

  • Phase 4: tag v0.2.0 (CHANGELOG, Helm chart bump, README).
  • Webhook / CEL validation for spec.secretBackend and spec.connectionString oneOf invariants (currently controller-side via secrets.NewBackend).
  • Reconsider Infisical SDK weight; possibly replace with a small REST client.
  • Stamp version/gitCommit/buildDate into the binary via -ldflags and log on startup (raised in PR review — separate PR to keep this diff focused).
  • cmd/backup-runner (untracked WIP) needs its own backend lookup once it lands.

@peter-svensson

Copy link
Copy Markdown
Member Author

Heads up — smoke-testing this branch's image (ghcr.io/opzkit/database-user-operator:pr-135) on a fresh cluster via make deploy results in CrashLoopBackOff:

flag provided but not defined: -leader-elect
Usage of /manager:
  -health-probe-bind-address string
  -kubeconfig string
  -metrics-bind-address string
  -zap-devel
  ...

config/manager/manager.yaml:33 still passes --leader-elect, but cmd/main.go no longer registers that flag (looks like leader election was dropped or its registration was removed during the refactor). The binary aborts on the unknown flag before the manager starts.

Fixes I can see (in order of preference):

  1. Re-add leader-election registration in cmd/main.go — restore the flag + the LeaderElection: enableLeaderElect field on the manager.Options{} literal. This matches the controller-runtime template and keeps the manifest correct.
  2. Drop --leader-elect from config/manager/manager.yaml if leader election is intentionally gone for now — but then HA replicas won't coordinate; not what you want.

Local workaround for anyone else hitting this:

kubectl -n database-user-operator-system patch deployment database-user-operator-controller-manager \
  --type=json \
  -p='[{"op":"remove","path":"/spec/template/spec/containers/0/args"}]'

Happy to send a small follow-up PR if you'd like — just confirm whether you want option 1 or 2.

@peter-svensson

Copy link
Copy Markdown
Member Author

Two more findings from the smoke test on Scaleway managed PG via secretBackend.kubernetes:

1. Default secretName produces an invalid Kubernetes Secret name

With secretBackend.kubernetes: {} and no explicit spec.secretName, the default is rds/<engine>/<databaseName> (e.g. rds/postgres/dbuo_smoketest). The Kubernetes backend then tries to delete/create a Secret by that name and fails:

failed to delete kubernetes secret: invalid resource name "rds/postgres/dbuo_smoketest": [may not contain '/']

The default was AWS-SM-friendly (path-shaped names are fine in Secrets Manager), but it isn't valid for the Kubernetes backend. The backend selection should either:

  • override the default name to a DNS-1123-compatible form (e.g. replace / with -), or
  • validate at admission and refuse the resource with a clear error rather than failing in the reconcile loop.

2. Stale-state recovery doesn't help when partial reconciles leave orphan users

When the first reconcile creates the PG user but the Secret-write step fails (e.g. connection-string parsing problem on a previous attempt left the user behind, then we fixed the DSN), the next reconcile errors with:

database and/or user exist but secret is missing - cannot recover password
(database exists: false, user exists: true, secret exists: false).
Please delete the Database CR and recreate it, or manually create the secret with the correct password

The suggestion to "delete the Database CR and recreate it" doesn't actually unblock anyone in the kubernetes-backend case until #1 above is fixed, because the finalizer's secret-delete step itself fails on the same /-in-name validation, leaving the CR in a perma-deleting state until the finalizer is patched out.

Workaround for #2 in the meantime: kubectl patch database <name> -p '{"metadata":{"finalizers":[]}}' --type=merge to release the CR, then DROP the orphan PG user manually before recreating with an explicit spec.secretName set to a DNS-1123 string.

Both feel like blockers for the kubernetes backend path. Happy to send a follow-up PR for #1 if you'd like — likely a small change in secrets.NewBackend (or wherever the secretName default is computed) to derive backend-aware defaults.

@peter-svensson

Copy link
Copy Markdown
Member Author

Third finding: the RBAC marker for Secrets in internal/controller/database_controller.go:55 only grants get;list;watch:

// +kubebuilder:rbac:groups="",resources=secrets,verbs=get;list;watch

The Kubernetes backend needs to create, update, and delete Secrets too — currently the controller fails the cleanup path with:

failed to delete kubernetes secret: secrets "dbuo-smoketest-creds" is forbidden:
User "system:serviceaccount:database-user-operator-system:database-user-operator-controller-manager"
cannot delete resource "secrets" in API group "" in the namespace "dbuo-test"

Suggested fix:

// +kubebuilder:rbac:groups="",resources=secrets,verbs=get;list;watch;create;update;patch;delete

(Same marker can stay cluster-scoped if secretBackend.kubernetes.namespace may differ from the Database's namespace. If we want to scope it down later, swap to a Role + RoleBinding generated from the same marker via +kubebuilder:rbac:...,namespace=... — but that's a bigger refactor.)

I'll send a small follow-up PR with the marker change unless you'd rather fold it into this one.

@peter-svensson

Copy link
Copy Markdown
Member Author

Fourth finding (root cause of the catch-22 we kept hitting earlier): on Scaleway managed PostgreSQL, the GRANT step in internal/database/postgres.go:205 fails:

failed to grant user to admin: pq: cannot use special role specifier in GRANT/REVOKE ROLE (XX000)

The offending statement:

grantQuery := fmt.Sprintf("GRANT %s TO CURRENT_USER", quoteIdentifier(username))

CURRENT_USER is a special role specifier in Postgres (alongside CURRENT_ROLE, SESSION_USER, PUBLIC). Some managed Postgres providers — Scaleway managed RDB confirmed, RDS reportedly intermittent depending on parameter group — reject it as a role-recipient and raise XX000 from the role-resolution layer.

Suggested fix: resolve CURRENT_USER to the actual connection role name once at controller startup (or once per connection) and substitute the literal role name into the GRANT:

var adminRole string
if err := c.db.QueryRowContext(ctx, "SELECT current_user").Scan(&adminRole); err != nil {
    return fmt.Errorf("failed to resolve current admin role: %w", err)
}
grantQuery := fmt.Sprintf("GRANT %s TO %s", quoteIdentifier(username), quoteIdentifier(adminRole))

This also gives nicer error messages when GRANT fails, since the literal role name appears in the SQL.

Combined with the earlier findings (catch-22 logic when secret-create fails, /-in-default-secretName for kubernetes backend, missing RBAC verbs, stale --leader-elect flag), the kubernetes backend currently can't complete a single Database reconcile against Scaleway managed PG.

Happy to put together a PR rolling all five fixes if you'd prefer them bundled, or send each separately.

@peter-svensson

Copy link
Copy Markdown
Member Author

Confirmed the fix locally — replacing CURRENT_USER with the resolved role name lets the GRANT succeed against Scaleway managed RDB. Patch:

--- a/internal/database/postgres.go
+++ b/internal/database/postgres.go
@@ -202,11 +202,14 @@ func (c *PostgresClient) CreateUser(ctx context.Context, username, password str
 			return fmt.Errorf("failed to create user: %w", err)
 		}
 
-		// Grant the new user to the current admin user so we can SET ROLE to it
-		// This is required for CREATE DATABASE ... OWNER to work in managed PostgreSQL (RDS, etc)
-		grantQuery := fmt.Sprintf("GRANT %s TO CURRENT_USER", quoteIdentifier(username))
-		_, err = c.db.ExecContext(ctx, grantQuery)
-		if err != nil {
-			return fmt.Errorf("failed to grant user to admin: %w", err)
+		// Grant the new user to the current admin role so we can SET ROLE to it.
+		// `CURRENT_USER` is rejected as a role-recipient by some managed PG providers
+		// (Scaleway managed RDB raises XX000 "cannot use special role specifier"),
+		// so resolve it to a literal role name first.
+		var adminRole string
+		if err := c.db.QueryRowContext(ctx, "SELECT current_user").Scan(&adminRole); err != nil {
+			return fmt.Errorf("failed to resolve current admin role: %w", err)
+		}
+		grantQuery := fmt.Sprintf("GRANT %s TO %s", quoteIdentifier(username), quoteIdentifier(adminRole))
+		if _, err := c.db.ExecContext(ctx, grantQuery); err != nil {
+			return fmt.Errorf("failed to grant user to admin (%s): %w", adminRole, err)
 		}
 	}

Side-note while testing: on Scaleway managed RDB the _rdb_superadmin trigger auto-grants membership in the new role to the connecting admin during CREATE USER, so the GRANT step ends up a no-op there (the literal-name version succeeds with a NOTICE: ... has already been granted membership ...). The fix doesn't need to be conditional — the GRANT remains correct on RDS-style providers, and idempotent on Scaleway. Just worth knowing for the test matrix.

So in summary, the five blockers found while smoke-testing this branch end-to-end (Scaleway managed RDB + secretBackend.kubernetes):

  1. --leader-elect flag in config/manager/manager.yaml no longer registered by the binary.
  2. Default secretName (rds/<engine>/<db>) contains /, invalid for the kubernetes backend.
  3. RBAC marker for Secrets only grants get;list;watch; needs create;update;patch;delete for the kubernetes backend.
  4. Catch-22 reconcile path when userExists && !secretExists is unrecoverable for the kubernetes backend (compounded by 1–3 leaving partial state behind).
  5. GRANT %s TO CURRENT_USER rejected by some managed PG providers — needs literal role name.

Fix #5 alone unblocks the happy path (smoke test reaches "user + database + secret created"); the other four make the failure modes much friendlier.

@peter-svensson

Copy link
Copy Markdown
Member Author

Unrelated suggestion while we're in here: the controller has no version banner on startup, which makes triage painful when running unreleased builds (it took us a few minutes to confirm which exact image we were looking at while debugging the GRANT issue, since pr-135 and a follow-up pr135-current-user-fix rebuild looked identical from the outside).

Standard kubebuilder pattern: stamp version, gitCommit, and buildDate into the binary at link time, then log them once at startup.

// cmd/main.go (top-level vars, populated by -ldflags)
var (
    version   = "dev"
    gitCommit = "none"
    buildDate = "unknown"
)

// inside main(), right before manager start:
setupLog.Info("starting database-user-operator",
    "version", version,
    "gitCommit", gitCommit,
    "buildDate", buildDate,
)

Dockerfile (or the Makefile target that calls go build):

ARG VERSION=dev
ARG GIT_COMMIT=none
ARG BUILD_DATE=unknown
RUN CGO_ENABLED=0 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} go build \
    -a -ldflags "-X main.version=${VERSION} -X main.gitCommit=${GIT_COMMIT} -X main.buildDate=${BUILD_DATE}" \
    -o manager cmd/main.go

build-and-scan.yaml already has ${{ steps.meta.outputs.version }} and ${{ github.sha }} available; pass them through as build args. BUILD_DATE can be $(date -u +%Y-%m-%dT%H:%M:%SZ) set in a prior step.

Ergonomic bonus: same vars can back a --version CLI flag and a /version HTTP endpoint on the existing health-probe listener if you want it scrapeable.

peter-svensson added a commit that referenced this pull request May 7, 2026
Five fixes surfaced by smoke testing the secretBackend.kubernetes path
against Scaleway managed PostgreSQL:

1. config/manager/manager.yaml: drop --leader-elect arg
   The flag was no longer registered (LeaderElection is hardcoded true
   for safe multi-replica operation), so the binary aborted on the
   unknown flag at startup. Removing the arg matches the new behaviour.
   BREAKING for kustomize users overriding the arg list.

2. internal/controller/database_controller.go: backend-aware default
   for secretName. Kubernetes Secret names reject '/'; the rds/<eng>/<db>
   path-shape was AWS-SM-friendly only. Use rds-<eng>-<db> for the
   kubernetes backend, keep slashes for AWS / Infisical.

3. RBAC marker: secrets verbs now include create/update/patch/delete.
   The kubernetes backend needs to write/delete Secrets, not just read.
   Regenerated config/rbac/role.yaml via 'make manifests'.

4. database_controller.go recovery: when userExists && !secretExists
   (and not a region-change), rotate the password via CreateUser
   (idempotent ALTER USER WITH PASSWORD) and recreate the secret
   instead of aborting with 'cannot recover password'. Previously left
   CRs perma-stuck because the kubernetes backend's finalizer also
   failed on the '/'-in-name issue from #2.

5. internal/database/postgres.go: resolve CURRENT_USER to a literal
   role name before issuing GRANT. Some managed PG providers (Scaleway
   managed RDB confirmed) reject 'GRANT role TO CURRENT_USER' with
   XX000 'cannot use special role specifier'. The literal-name form
   succeeds on RDS and on Scaleway (idempotent — Scaleway's
   _rdb_superadmin trigger already grants membership at CREATE USER).
@peter-svensson peter-svensson requested a review from argoyle as a code owner May 7, 2026 09:33
@peter-svensson

Copy link
Copy Markdown
Member Author

Pushed bf1a8f6 resolving the five smoke-test blockers (and one related cleanup):

  1. --leader-elect crash — went with the simpler path of hardcoding LeaderElection: true in cmd/manager/main.go and dropping the --leader-elect arg from config/manager/manager.yaml. Documented as breaking for any kustomize users overriding the arg list. Helm values.yaml never passed the flag, so chart users are unaffected. docs/TROUBLESHOOTING.md example updated to drop the flag.

  2. Default secretName invalid for kubernetes backendgetSecretNameOrDefault is now backend-aware. AWS / Infisical keep rds/<engine>/<db>; kubernetes uses rds-<engine>-<db> (DNS-1123 compliant). Existing tests still pass since they don't set the kubernetes backend.

  3. RBAC verbs — applied your suggested marker verbs=get;list;watch;create;update;patch;delete, regenerated config/rbac/role.yaml.

  4. Catch-22 when userExists && !secretExists — replaced the unrecoverable error with a recovery path: rotate password via CreateUser (idempotent — ALTER USER WITH PASSWORD when the user exists), create the database if missing, fall through to the normal secret-write step. Safe because the operator owns the user (stamped by COMMENT ON ROLE at create time).

  5. GRANT … TO CURRENT_USER rejected by managed PG — applied your patch verbatim. SELECT current_user resolves the literal role, then GRANT %s TO %s. Confirmed idempotent on Scaleway's _rdb_superadmin-trigger setup.

Version banner via -ldflags is a good suggestion — filing as a separate follow-up rather than folding into this PR to keep the diff focused. Will track once #135 lands.

PR body's "Migration: see Breaking changes" section unchanged; the leader-election arg removal is now also breaking for raw-kustomize users — will add a note to the body.

@opzkit-auto-merge opzkit-auto-merge Bot enabled auto-merge (squash) May 7, 2026 09:41
@peter-svensson peter-svensson changed the title feat(secrets): pluggable secret backends — Kubernetes Secret + Infisical Cloud (BREAKING) feat!: pluggable secrets/connection backends (k8s, infisical, aws) May 7, 2026
@peter-svensson

Copy link
Copy Markdown
Member Author

Sixth blocker (only surfaced once #5's CURRENT_USER patch was in place): on PG 16+ the GRANT defaults changed to NOINHERIT, NOSET, NOADMIN, and CREATE DATABASE x OWNER y / ALTER DATABASE x OWNER TO y now require the executor to be a member of y with the SET privilege. Bare membership is no longer enough.

Reproduces on Scaleway managed RDB (PG 16) — the operator gets to the create-database step and errors:

failed to create database: pq: must be able to SET ROLE "dbuo_smoketest" (42501)

It doesn't reproduce on AWS RDS / Aurora — even on PG 17 — because RDS/Aurora ship an event-trigger on CREATE ROLE that auto-runs something like GRANT new_role TO rds_superuser WITH INHERIT, SET, ADMIN OPTION, papering over PG 16's stricter defaults. Scaleway managed RDB has a similar trigger (_rdb_superadmin shows up in NOTICE: ... has already been granted membership ... by role "_rdb_superadmin"), but it only grants bare membership — no flags. So DBUO works on RDS today, breaks on Scaleway, and would also break on a self-hosted PG 16+ without the same trigger.

Verified the patch:

GRANT dbuo_smoketest TO intersolia_admin WITH INHERIT TRUE, SET TRUE;
-- then
CREATE DATABASE dbuo_smoketest_alt OWNER dbuo_smoketest;  -- succeeds

Suggested fix — combine with #5, since both touch the same GRANT statement:

--- a/internal/database/postgres.go
+++ b/internal/database/postgres.go
@@ -202,11 +202,21 @@ func (c *PostgresClient) CreateUser(ctx context.Context, username, password str
 			return fmt.Errorf("failed to create user: %w", err)
 		}
 
-		// Grant the new user to the current admin user so we can SET ROLE to it
-		// This is required for CREATE DATABASE ... OWNER to work in managed PostgreSQL (RDS, etc)
-		grantQuery := fmt.Sprintf("GRANT %s TO CURRENT_USER", quoteIdentifier(username))
-		_, err = c.db.ExecContext(ctx, grantQuery)
-		if err != nil {
-			return fmt.Errorf("failed to grant user to admin: %w", err)
+		// Grant new user → admin with INHERIT + SET so CREATE/ALTER DATABASE
+		// ... OWNER works on PG 16+ (defaults changed to NOINHERIT, NOSET).
+		// Resolve CURRENT_USER to a literal name first (some managed PG
+		// providers reject CURRENT_USER as a special role specifier — XX000).
+		// Aurora/RDS auto-grant these flags via an event trigger; Scaleway
+		// managed RDB and self-hosted PG 16+ do not.
+		var adminRole string
+		if err := c.db.QueryRowContext(ctx, "SELECT current_user").Scan(&adminRole); err != nil {
+			return fmt.Errorf("failed to resolve current admin role: %w", err)
+		}
+		grantQuery := fmt.Sprintf(
+			"GRANT %s TO %s WITH INHERIT TRUE, SET TRUE",
+			quoteIdentifier(username),
+			quoteIdentifier(adminRole),
+		)
+		if _, err := c.db.ExecContext(ctx, grantQuery); err != nil {
+			return fmt.Errorf("failed to grant user to admin (%s): %w", adminRole, err)
 		}
 	}

PG <16 ignores WITH SET TRUE (the column doesn't exist) but does honour WITH INHERIT TRUE, so the statement works on PG 14/15 too. Verified the syntax is parsed cleanly on PG 16.4.

So the running tally of blockers found smoke-testing this branch on Scaleway managed RDB + secretBackend.kubernetes:

  1. Stale --leader-elect flag in config/manager/manager.yaml.
  2. Default secretName (rds/<engine>/<db>) contains /, invalid for kubernetes backend.
  3. RBAC marker for Secrets only get;list;watch.
  4. Catch-22 reconcile when userExists && !secretExists with secretBackend.kubernetes (already addressed in latest pr-135 build with the "rotate password and recreate secret" recovery path — confirmed working).
  5. GRANT %s TO CURRENT_USER rejected as special role specifier (XX000).
  6. (this comment) PG 16+ requires WITH INHERIT TRUE, SET TRUE on the admin grant.

Plus the unrelated build-info banner suggestion in the previous comment.

peter-svensson added a commit that referenced this pull request May 7, 2026
Five fixes surfaced by smoke testing the secretBackend.kubernetes path
against Scaleway managed PostgreSQL:

1. config/manager/manager.yaml: drop --leader-elect arg
   The flag was no longer registered (LeaderElection is hardcoded true
   for safe multi-replica operation), so the binary aborted on the
   unknown flag at startup. Removing the arg matches the new behaviour.
   BREAKING for kustomize users overriding the arg list.

2. internal/controller/database_controller.go: backend-aware default
   for secretName. Kubernetes Secret names reject '/'; the rds/<eng>/<db>
   path-shape was AWS-SM-friendly only. Use rds-<eng>-<db> for the
   kubernetes backend, keep slashes for AWS / Infisical.

3. RBAC marker: secrets verbs now include create/update/patch/delete.
   The kubernetes backend needs to write/delete Secrets, not just read.
   Regenerated config/rbac/role.yaml via 'make manifests'.

4. database_controller.go recovery: when userExists && !secretExists
   (and not a region-change), rotate the password via CreateUser
   (idempotent ALTER USER WITH PASSWORD) and recreate the secret
   instead of aborting with 'cannot recover password'. Previously left
   CRs perma-stuck because the kubernetes backend's finalizer also
   failed on the '/'-in-name issue from #2.

5. internal/database/postgres.go: resolve CURRENT_USER to a literal
   role name before issuing GRANT. Some managed PG providers (Scaleway
   managed RDB confirmed) reject 'GRANT role TO CURRENT_USER' with
   XX000 'cannot use special role specifier'. The literal-name form
   succeeds on RDS and on Scaleway (idempotent — Scaleway's
   _rdb_superadmin trigger already grants membership at CREATE USER).
@peter-svensson peter-svensson force-pushed the refactor/secrets-backend-interface branch from aab79ee to 09304de Compare May 7, 2026 10:34
Adds a backend-agnostic Backend interface (Exists, Get, Create, Update,
Delete, Locator, SyncTags) and a NewBackend factory that picks an
implementation based on the Database CR spec. Existing
AWSSecretsManagerClient gains adapter methods so it satisfies Backend
without renaming any of its current public methods — call sites that
need AWS-specific behaviour (cross-region migration, source-side
GetSecretString) continue to work unchanged.

No behaviour change. Tests still pass.

This is the first of two refactor commits towards supporting additional
secret backends (Kubernetes Secret, Infisical Cloud).
The two AWS-specific Database status fields are folded into a single
generic SecretLocator string. For the AWS backend the locator is the
ARN, which already carries the region; the cross-region migration
logic now derives the previously-used region from the stored ARN via
secrets.AWSRegionFromARN. Other backends (Kubernetes Secret, Infisical
— added in subsequent phases) will populate SecretLocator with their
own identifiers.

The "Region" printcolumn becomes "Locator" with priority=1 (hidden by
default in `kubectl get`, visible with -o wide).

Behaviour preserved:
- Cross-region migration still detects, fetches password from old
  region, creates in new region, deletes old after success.
- Tag-sync, idempotent reconciliation, restore-from-deletion
  unchanged.

This is the second commit of the Phase 1 refactor (Backend interface
extraction). Phase 2 adds the Kubernetes Secret backend; Phase 3 adds
Infisical Cloud.
Output of `make manifests && make helm-crds`. No semantic change beyond
the field rename done in the previous commit; this just refreshes the
CRD to the current controller-gen formatting (description reflow,
Condition struct doc, etc.) and populates config/crd/bases/ which was
empty.
…ionString

BREAKING CHANGE — drops the v0.1.x flat-field shape:

  spec.connectionStringSecretRef       → spec.connectionString.kubernetes
  spec.connectionStringAWSSecretRef    → spec.connectionString.aws
  spec.awsSecretsManager               → spec.secretBackend.aws

The new spec.secretBackend.{aws,kubernetes,infisical} oneOf is the
seam for upcoming backends. Phase 2 fills in spec.secretBackend.kubernetes
(KubernetesSecretBackend already declared). Phase 3 fills in the
Infisical backend (InfisicalSecretBackend already declared with
ProjectSlug/EnvironmentSlug/AuthSecretRef etc).

Validation: the controller's existing oneOf checks for connectionString
remain; secrets.NewBackend returns ErrMultipleBackendsConfigured /
ErrNoBackendConfigured if zero or multiple secretBackend.* are set.

Sample manifests, CRDs, and unit tests updated. AWSSecretsManagerConfig,
SecretKeyReference, AWSSecretReference types removed.

Refs Phase 2 / Phase 3 / Phase 4 release work.
internal/secrets/kubernetes.go implements the Backend interface against
controller-runtime's client.Client, storing the DatabaseSecret JSON
blob under a single 'credentials' key in a Kubernetes Secret. Shape
matches the AWS Secrets Manager backend so consumers can decode the
same JSON regardless of which backend is configured.

- Locator format: <namespace>/<name>
- Description lands in the database.opzkit.io/description annotation
  (Kubernetes Secrets have no native description field)
- Tags are mapped onto Secret labels (caller is responsible for
  ensuring values pass Kubernetes label validation)
- Delete is idempotent on missing secrets (matches AWS impl)
- Create overwrites an existing Secret to mirror AWS's
  restore-from-soft-deletion behaviour
- forceDelete is ignored — Kubernetes Secrets have no soft-delete

Wired into secrets.NewBackend via the new spec.secretBackend.kubernetes
case. The factory signature gains a client.Client parameter (used by
the K8s backend; ignored by AWS and Infisical).

Tests cover create/get/update/delete/exists/syncTags via
controller-runtime's fake client.

Also adds the requested AWSRegionFromARN unit test.

The controller still calls AWSSecretsManagerClient directly for now;
the next commit refactors it to use the Backend interface so
spec.secretBackend.kubernetes works end-to-end.
Replaces direct AWSSecretsManagerClient calls in the destination
secret create/update/delete/exists/get/sync-tags paths with the
generic secrets.Backend interface (constructed via secrets.NewBackend).
spec.secretBackend.kubernetes now works end-to-end against a
Kubernetes Secret in the cluster; spec.secretBackend.aws continues
to behave exactly as before (AWS adapter satisfies the interface).

AWS-specific paths kept direct:
- Cross-region migration in reconcileDatabase + secret reconciliation
  paths (oldRegionClient — only triggered when SecretBackend.AWS is set
  and the locator is an AWS ARN whose region differs from spec)
- Source-side reading from AWS Secrets Manager via
  getConnectionStringFromAWSSecret (the connectionString.aws path)

Removed dead helpers tagsEqual / getTagsToRemove / getTagsToAdd —
each backend implements SyncTags internally. Their unit tests went
with them.

Status field rename: secretARN local var → secretLocator throughout
the create/update flow, populated from backend.Locator() / backend
return values.

Build green, full test suite passes.
internal/secrets/infisical.go implements the Backend interface against
Infisical (Cloud or self-hosted) via Universal Auth. The DatabaseSecret
JSON blob is stored as a single Infisical secret value, matching the
shape used by AWS Secrets Manager and the Kubernetes backend.

- Authentication: Universal Auth (clientId + clientSecret) read from a
  Kubernetes Secret in the same namespace as the Database, referenced
  by spec.secretBackend.infisical.authSecretRef.
- Locator format: infisical://<projectID>/<env><path>/<key>
- Secret-key sanitisation: slashes in the supplied name (default
  rds/<engine>/<db>) are replaced with underscores so the result is a
  valid Infisical secret key.
- Restore-on-create: if the secret already exists, Create overwrites
  it via Update — same semantics as AWS / Kubernetes.
- SyncTags is a no-op: Infisical tags are first-class entities with
  their own UUIDs; mapping AWS-style key/value tags onto them is out
  of scope.
- Versioning: Infisical's V3 secrets API doesn't expose per-secret
  version IDs, so Create/Update return an empty version string.

Wired into secrets.NewBackend via the spec.secretBackend.infisical
case. Factory reads the auth Secret using the K8s client.

CRD: spec.secretBackend.infisical now exposes ProjectID (UUID) and
Environment (slug) instead of the previously declared ProjectSlug /
EnvironmentSlug — the Infisical V3 API requires the UUID for
create/update/delete, slug only works on retrieve.

Tests use an in-memory fake implementing the InfisicalSecretClient
interface, covering create/get/update/delete/exists/sync-tags +
key sanitisation and login-error propagation.

Adds dependency github.com/infisical/go-sdk v0.7.1 (and its transitive
google.golang.org/api / OpenTelemetry / grpc deps — substantial; see
v0.2.0 release notes).

Sample manifest under config/samples/.

Closes Phase 3 of the multi-backend refactor.
Five fixes surfaced by smoke testing the secretBackend.kubernetes path
against Scaleway managed PostgreSQL:

1. config/manager/manager.yaml: drop --leader-elect arg
   The flag was no longer registered (LeaderElection is hardcoded true
   for safe multi-replica operation), so the binary aborted on the
   unknown flag at startup. Removing the arg matches the new behaviour.
   BREAKING for kustomize users overriding the arg list.

2. internal/controller/database_controller.go: backend-aware default
   for secretName. Kubernetes Secret names reject '/'; the rds/<eng>/<db>
   path-shape was AWS-SM-friendly only. Use rds-<eng>-<db> for the
   kubernetes backend, keep slashes for AWS / Infisical.

3. RBAC marker: secrets verbs now include create/update/patch/delete.
   The kubernetes backend needs to write/delete Secrets, not just read.
   Regenerated config/rbac/role.yaml via 'make manifests'.

4. database_controller.go recovery: when userExists && !secretExists
   (and not a region-change), rotate the password via CreateUser
   (idempotent ALTER USER WITH PASSWORD) and recreate the secret
   instead of aborting with 'cannot recover password'. Previously left
   CRs perma-stuck because the kubernetes backend's finalizer also
   failed on the '/'-in-name issue from #2.

5. internal/database/postgres.go: resolve CURRENT_USER to a literal
   role name before issuing GRANT. Some managed PG providers (Scaleway
   managed RDB confirmed) reject 'GRANT role TO CURRENT_USER' with
   XX000 'cannot use special role specifier'. The literal-name form
   succeeds on RDS and on Scaleway (idempotent — Scaleway's
   _rdb_superadmin trigger already grants membership at CREATE USER).
…retBackend

Mass rename across README and docs:
  spec.connectionStringSecretRef    -> spec.connectionString.kubernetes
  spec.connectionStringAWSSecretRef -> spec.connectionString.aws
  spec.awsSecretsManager            -> spec.secretBackend.aws

Also refreshed TROUBLESHOOTING.md prose to reflect that AWS permissions
are now optional (depend on the chosen backend, not always required) and
removed the stale --leader-elect example since the flag is no longer
registered.
PG 16 changed GRANT defaults to NOINHERIT, NOSET, NOADMIN.
CREATE/ALTER DATABASE ... OWNER now requires the executor to
hold SET privilege on the owner role, not just bare membership.

Aurora/RDS paper over this via an event trigger on CREATE ROLE
that grants WITH INHERIT, SET, ADMIN OPTION. Scaleway managed
RDB and self-hosted PG 16+ do not, so the operator failed with
'must be able to SET ROLE' (42501) on those providers.

Add WITH INHERIT TRUE, SET TRUE to the admin-grant statement.
PG <16 honours INHERIT and ignores the SET column cleanly.
PostgreSQL 14+ (tested on 15/16/17), MySQL 8.0+, MariaDB 10.6+.
Also clarify AWS credentials are only required when the AWS
backends (secretBackend.aws or connectionString.aws) are used.
…end shape

Old ConnectionStringSecretRef + bare SecretName fields removed in
158fda5. Integration tests still referenced them and failed to
build. Replace with spec.connectionString.kubernetes (admin DSN
source) and spec.secretBackend.aws (generated-credential dest)
across all 12 createDatabase call sites.
Old test expected phase=Error with message about cannot recover password.
New behaviour (introduced with the smoke-test blocker fix in e619043):
controller rotates the password via ALTER USER and recreates the secret.

Updated test now waits for Ready phase and verifies the secret reappears
in AWS Secrets Manager with the same username and a non-empty password.

@argoyle argoyle left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏🏻

@opzkit-auto-merge opzkit-auto-merge Bot merged commit ccc44a8 into main May 7, 2026
7 checks passed
@opzkit-auto-merge opzkit-auto-merge Bot deleted the refactor/secrets-backend-interface branch May 7, 2026 14:27
peter-svensson added a commit that referenced this pull request May 7, 2026
…tBackend shape

PR #135 merged to main while this PR was open, replacing
spec.connectionStringSecretRef + bare spec.secretName with
spec.connectionString.kubernetes + spec.secretBackend.aws.
peter-svensson added a commit that referenced this pull request May 7, 2026
The WITH INHERIT/SET options on GRANT are PG 16+ syntax. PR #135
emitted them unconditionally, which broke the operator on PG 14
and 15 with 'syntax error at or near INHERIT' (42601), surfaced
by the new PG version matrix.

Detect server_version_num via current_setting() and only append
WITH INHERIT TRUE, SET TRUE on PG 16+. Pre-16 the default GRANT
is sufficient — bare role membership has always implied SET ROLE
on those versions, so CREATE/ALTER DATABASE OWNER works.
opzkit-auto-merge Bot pushed a commit that referenced this pull request May 7, 2026
* test(integration): add PostgreSQL 14/15/16/17/18 compatibility matrix

Deploy 5 versioned PostgreSQL instances (postgres-14 through
postgres-18) in the kind cluster and run a focused matrix test
exercising the CREATE USER + GRANT + CREATE DATABASE OWNER path
for each version.

PG 16+ tightened GRANT defaults to NOINHERIT, NOSET, NOADMIN.
A regression in the operator's admin-grant statement that omits
WITH INHERIT TRUE, SET TRUE will now fail loudly on PG 16/17/18
while still passing on PG 14/15. The matrix is light — one
Database CR per version, asserting Ready phase and secret
presence — so it stays cheap (~1 GB RAM, ~30-60s extra CI time)
while pinning behaviour across the supported version range.

* test(integration): use non-superuser admin in PG version matrix

Connecting as the built-in 'postgres' superuser bypasses the SET
ROLE check entirely (superusers can become any role unconditionally),
so the original matrix passed even on PG 16/17/18 where the bug
class is supposed to bite.

Bootstrap each PG instance with a non-superuser 'dbuo_admin' role
(LOGIN, CREATEDB, CREATEROLE, NOSUPERUSER, NOINHERIT) via a
ConfigMap-mounted /docker-entrypoint-initdb.d/init.sql. Repoint the
per-version connection-string Secrets at this admin. This mirrors
managed-PG providers (Scaleway managed RDB, self-hosted PG 16+
without RDS-style event triggers) where the admin is not a
superuser, so the operator's GRANT path is actually exercised.

* test(integration): adapt PG matrix to new spec.connectionString/secretBackend shape

PR #135 merged to main while this PR was open, replacing
spec.connectionStringSecretRef + bare spec.secretName with
spec.connectionString.kubernetes + spec.secretBackend.aws.

* fix(database): version-detect GRANT clause for PG 14/15 compatibility

The WITH INHERIT/SET options on GRANT are PG 16+ syntax. PR #135
emitted them unconditionally, which broke the operator on PG 14
and 15 with 'syntax error at or near INHERIT' (42601), surfaced
by the new PG version matrix.

Detect server_version_num via current_setting() and only append
WITH INHERIT TRUE, SET TRUE on PG 16+. Pre-16 the default GRANT
is sufficient — bare role membership has always implied SET ROLE
on those versions, so CREATE/ALTER DATABASE OWNER works.

* test(integration): use INHERIT admin in PG matrix init

NOINHERIT prevented the admin from inheriting newuser's ownership
privileges, so COMMENT ON DATABASE and DROP DATABASE failed with
'must be owner of database' (42501) on every PG version after the
operator transferred ownership to the new user.

Real managed-PG providers (AWS rds_superuser, Scaleway
_rdb_superadmin, Aurora) all use INHERIT admin roles, so this
mirrors production more accurately and unblocks the matrix while
still keeping the admin non-superuser.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants